{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# (a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given a set of example $\\{x^{(i)}, y^{(i)}\\}_{i=1}^{m}$, the loss function of a logistic regression model is\n",
    "\n",
    "$$\n",
    "J(\\theta) = \\frac{1}{m}\\sum_{i=1}^{m}\\bigg(y^{(i)} \\log (h_{\\theta}(x^{(i)})) + (1 - y^{(i)}) \\log(1 - h_{\\theta}(x^{(i)})) \\bigg)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to maximize the likelihood, we set the derivative w.s.t $\\theta$ to 0, and then obtain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{align*}\n",
    "\\frac{\\partial J(\\theta)}{\\partial \\theta_j}\n",
    "&= \\sum_{i=1}^m\\bigg(\\frac{y^{(i)}}{h_{\\theta}(x^{(i)})} \\frac{\\partial h_{\\theta}(x^{(i)})}{\\partial \\theta_j} + \\frac{y^{(i)} - 1}{1 - h_{\\theta}(x^{(i)})} \\frac{\\partial h_{\\theta}(x^{(i)})}{\\partial \\theta_j}\\bigg) \\\\\n",
    "&= \\sum_{i=1}^m\\bigg(\\frac{y^{(i)} - h_{\\theta}(x^{(i)})}{h_{\\theta}(x^{(i)}) (1 - h_{\\theta}(x^{(i)}))} \\frac{\\partial h_{\\theta}(x^{(i)})}{\\partial \\theta_j} \\\\\n",
    "& = \\sum_{i=1}^m\\bigg(\\frac{y^{(i)} - h_{\\theta}(x^{(i)})}{h_{\\theta}(x^{(i)}) (1 - h_{\\theta}(x^{(i)}))} \\bigg( h_{\\theta}(x^{(i)})(1 - h_{\\theta}(x^{(i)}) \\frac{\\partial (-\\theta^T x^{(i)})}{\\partial \\theta_j} \\bigg) \\bigg) \\\\\n",
    "& = \\sum_{i=1}^m ( y^{(i)} - h_{\\theta}(x^{(i)})) x^{(i)}_j = 0\n",
    "\\end{align*}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $h$ is the logistic function for calculating probability, i.e. $h(x^{(i)}) = \\frac{1}{1 + e^{-\\theta^T x^{(i)}}}$.\n",
    "* In the 3rd step, we used the fact of derivative for sigmoid function ([a special case of logistic function](https://stats.stackexchange.com/questions/204484/what-are-the-differences-between-logistic-function-and-sigmoid-function)): $\\dot \\sigma = \\sigma(1-\\sigma)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write the derivative in matrix form,\n",
    "\n",
    "$$X(Y - h(X)) = \\boldsymbol{0}$$\n",
    "\n",
    "where $X$ is of shape $n \\times m$ ($n$ features, $m$ examples), $Y - h(X)$ is of shape $m \\times 1$. $h(X)$ is a vector of all probabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because of the added intercept terms for each $x^{(i)}$, expand $X$, we have\n",
    "\n",
    "$$\n",
    "\\begin{bmatrix}\n",
    " 1 &  \\cdots & 1 \\\\ \n",
    " x_1^{(1)} & \\cdots & x_1^{(m)} \\\\ \n",
    " \\vdots    & \\ddots  & \\vdots \\\\ \n",
    " x_n^{(1)}& \\cdots & x_n^{(m)}\n",
    "\\end{bmatrix} (Y - h(X)) = \\boldsymbol{0}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider the first row of $X$ only, we have\n",
    "\n",
    "\\begin{align*}\n",
    "[1, \\cdots, 1](Y - h(X)) &= \\sum_{i=1}^{m}(y^{(i)} - h(x^{(i)})) = 0\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{align*}\n",
    "\\sum_{i=1}^{m}h(x^{(i)}) \n",
    "&= \\sum_{i=1}^{m} P(y^{(i)}=1|x^{(i)};\\theta) \\\\\n",
    "&= \\sum_{i=1}^{m}y^{(i)} \\\\\n",
    "&= \\boldsymbol{1} \\{y^{(i)} = 1\\}\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The 1st equality is by definition; the 2nd equality is a direct result of the above derivation; the 3rd equality is also by definition."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is equivalent to \n",
    "\n",
    "$$\n",
    "\\frac{\\sum_{i \\in I_{0, 1}} P(y^{(i)} = 1 | x^{(i)}; \\theta)}{\\left | \\{i \\in I_{0, 1} \\} \\right |} = \\frac{\\sum_{i \\in I_{0, 1}} \\boldsymbol{1}\\{y^{(i)} = 1\\}}{\\left | \\{i \\in I_{0, 1} \\} \\right |}\n",
    "$$\n",
    "\n",
    "given $(a, b) = (0, 1)$ and all probabilities should be between $(0, 1)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# (b)\n",
    "\n",
    "1. Perfect calibration doesn't necessarily imply perfect accuracy because caliration is about probability. For example, for a particular $(a, b)$ and a set of test samples, by swapping the predicted probabilities of two samples, it would affect accuracy, but not the calibration.\n",
    "\n",
    "1. Conversely, perfect accuracy shall lead to perfect calibration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# (c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "By including the $L_2$ regularization, the loss function becomes\n",
    "\n",
    "$$\n",
    "J'(\\theta) = J(\\theta) + \\lambda \\left|\\left|\\theta\\right|\\right|^2\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following the same logic in (a),"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\frac{\\partial J'(\\theta)}{\\partial \\theta_j} = \\frac{\\partial J(\\theta)}{\\partial \\theta_j} + 2 \\lambda \\theta_j\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In matrix form,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$X(Y - h(X)) + 2 \\lambda \\Theta= \\boldsymbol{0}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Used $\\Theta$ instead of $\\theta$ to be explicit it's a vector (shape: $n \\times 1$)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Expanding the left-hand side,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\begin{bmatrix}\n",
    " 1 &  \\cdots & 1 \\\\ \n",
    " x_1^{(1)} & \\cdots & x_1^{(m)} \\\\ \n",
    " \\vdots    & \\ddots  & \\vdots \\\\ \n",
    " x_n^{(1)}& \\cdots & x_n^{(m)}\n",
    "\\end{bmatrix} (Y - h(X)) + 2 \\lambda\n",
    "\\begin{bmatrix} \n",
    "\\theta_0 \\\\\n",
    "\\theta_1 \\\\\n",
    "\\vdots \\\\\n",
    "\\theta_n \\\\\n",
    "\\end{bmatrix}\n",
    "= \\boldsymbol{0}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Taking the first row,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{align*}\n",
    "[1, \\cdots, 1](Y - h(X)) + 2 \\lambda \\theta_0 &= \\sum_{i=1}^{m}(y^{(i)} - h(x^{(i)})) + 2 \\lambda \\theta_0 = 0\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{align*}\n",
    "\\sum_{i=1}^{m}h(x^{(i)})\n",
    "&= \\sum_{i=1}^{m} P(y^{(i)}=1|x^{(i)};\\theta) \\\\\n",
    "&= \\sum_{i=1}^{m}y^{(i)} + 2\\lambda \\theta_0 \\\\\n",
    "&= \\boldsymbol{1} \\{y^{(i)} = 1\\} + 2\\lambda \\theta_0\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hence, by including $L_2$ regularization, the prediction will be biased by $c \\theta_0$, where $c$ ($2\\lambda$) is a constant."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
